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Abstract 

Sequence comparison across multiple organisms aids in the detection of regions un- 
der selection. However, resource limitations require a prioritization of genomes to be se- 
quenced. This prioritization should be grounded in two considerations: the lineal scope 
encompassing the biological phenomena of interest, and the optimal species within that 
scope for detecting functional elements. We introduce a statistical framework for optimal 
species subset selection, based on maximizing power to detect conserved sites. In a study 
of vertebrate species, we show that the optimal species subset is not in general the most 
evolutionarily diverged subset. Our results suggest that marsupials are prime sequencing 
candidates. 

Introduction 

Comparative genomic methods can reveal conserved regions in multiple organisms, including 
functional elements undetected by single- sequence analyses LL2J. Individual studies have 
demonstrated the effectiveness of genomic comparison for specific regions and elements ||3l 
131 |5l |6l |7|. Such successes indicate that comparative considerations should play a major role 
in decisions about what unsequenced species to sequence next. For comparative purposes, 
sequencing choices must first of all be guided by specification of the widest range of species 
sharing the functions or characters in question, which we call the lineal scope [ 8 1 ' . Boffelli et 

'This differs from the "phylogenetic scope" of Cooper et al. |9| in that Hneal scope is determined solely by a 
biological trait of interest, whereas phylogenetic scope can be determined according to any considerations. 
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al. lITOl discuss the utility of comparisons in lineal scopes ranging from the primate clade to the 
vertebrate tree. 

Most lineal scopes selected in practice will include far more extant species than can be se- 
quenced with today's resources. Thus, sequencing prioritization is an unavoidable issue, both 
for smaller-scale efforts targeting particular regions and for whole-genome projects, whose fo- 
cus should reflect in part the aggregate needs of comparative analyses. Few studies on compar- 
ative methods provide a quantitative framework for decision-making about what to sequence. 
An exception is the work of Sidow and others [Q.TTl: given a set of sequenced organisms and 
an inferred phylogeny. Cooper et al. \9\ argue that decisions should be based on maximizing 
additive evolutionary divergence in a phylogenetic tree. 

While additive divergence captures part of the problem underlying organism choice, it fails 
to reflect the inherent tradeoff that characterizes the problem. On the one hand, the success of 
procedures for assessing conservation does depend on sufficient evolutionary distance among 
the sequences [,5^4.. 12|. On the other hand, a given set of species may have diverged too far from 
one another to be useful, even when orthology is preserved: in the limit of large evolutionary 
distance, conservation and nonconservation are just as indistinguishable as at distance zero ifTSl . 
Furthermore, phylogenetic topology has counterintuitive effects on usefulness. 

Here, we present a decision-theoretic framework which subsumes these issues, providing a 
procedure for making systematic, quantitative choices of species to sequence. Statistical power 
is our optimality criterion for species selection. Thus, we measure the effectiveness of a species 
subset directly in terms of error rates for detecting and overlooking conservation at a single 
orthologous site. Measuring power disentangles effects due to the number of species used from 
effects due to relative evolutionary distances in the phylogeny. We illustrate these ideas theoret- 
ically, in a star phylogeny analysis, and practically, with an empirically-derived phylogeny on 
21 representative vertebrate species. The results indicate that adding the dunnart or a closely- 
related marsupial to finished and underway vertebrate sequences would most increase the power 
to detect conservation at single-nucleotide resolution. 

Setup 

We frame conservation detection in the following decision-theoretic setting. The data x are the 
nucleotides at an orthologous site across a set of species, i.e., an ungapped alignment column. 
We view these bases as corresponding to the leaves of a phylogeny with unobserved ancestral 
bases. We assume that the phylogenetic topology, the Markov substitution process along the 
branches, and the branch lengths are all known. The phylogeny induces the observed-data 
probability distribution p(x; r) as the marginal distribution on its leaves, which can be evaluated 
efficiently for any x and r fT^. The parameter r > is an unknown global mutation rate shared 
among all branches. We choose two threshold values > tq for r: an actual mutation rate 
of at least corresponds by definition to a nonconserved site, whereas a rate no more than 
rc means the site is strongly conserved. When > r > rc, the conservation is too weak to 
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interest us. 

The decision-theoretic goals are now twofold. First, fixing a set of species, we wish to 
select a decision rule (5(x) which declares the site either nonconserved (5(x) = 0) or con- 
served (5(x) = 1) using only data from those species. Every nontrivial 5(x) will have pos- 
itive probability of making two mistakes: when r > rjv, Pr{5(X.) = 1) is the probability it 
erroneously detects conservation, and when r < rc, Pr.(^(X) = 0) is the probability it over- 
looks conservation. Minimizing these probabilities guides our choice of 5(x). We formulate a 
Neyman-Pearson hypothesis test [15| of the null hypothesis Hq: r > r^- versus the alternative 
hypothesis Ha'- r < rc, stipulating a maximum allowed probability a of falsely rejecting Hq 
(falsely declaring conservation). Controlling this error probability is a central concern [9|. Sub- 
ject to this constraint, we find a test statistic 5(x) with large power to detect conservation, that 
is, small probability of overlooking conservation. The second goal is to maximize this power 
over subtrees in the larger phylogeny determined by the chosen lineal scope, such as all subtrees 
on k extant species within the anthropoid clade, where k is determined by sequencing resource 
limitations. 

Symmetric star topology 

We initially pursue these goals in a phylogenetic setting called the symmetric star topology 
(SST), where k extant species are connected to a single ancestor by branches of common length 
t > 0. Choosing k and t in the SST is akin to choosing k extant species within a larger phy- 
logeny, such that each pair of chosen species is at a distance of approximately 2t. Hypothesis 
testing in the fully-observed SST (fosst), with known ancestral base, closely approximates test- 
ing in the hidden-ancestor SST (hasst), the case of interest, for small to moderate t (Figure[T]). 
This follows because there is little uncertainty about the ancestral base at short evolutionary 
distances: with high probability, it equals the most-occurring base among the descendants. The 
analogy matters because we know the uniformly most-powerful testing procedure under the 
FOSST: it rejects Hq (declares conservation) for large values of the likelihood ratio statistic 
p(x; rc) /p(x; r^v) (see Appendix). 

FigurelU^ shows the power of the FOSST likelihood-ratio test against the particular alterna- 
tive distribution r = rc, as t and k vary. Power against other alternatives r < rc is larger (see 
Appendix). For each t, power increases monotonically in k. However, for each k, there is a 
unique power-maximizing branch length t*{k). In the Appendix we explain this in terms of sta- 
tionary Markov substitution processes. Fundamentally, it happens because both nonconserved 
and conserved sites accrue mutations, and the difference in their mutation rates becomes irrele- 
vant as t — *^ (X). A consequence of this is the suboptimality of maximizing additive divergence: 
for any k, the optimal tree has finite divergence k ■ t* [k), rather than arbitrarily large divergence. 
Comparing Figures[I]\ and[T^ shows the FOSST accurately approximates the HASST in a large 
interval around t*{k) for > 2, so the conclusion also applies to the HASST. As k increases, 
t*{k) stabilizes at a nonzero value (Figure |2l). Thus, the optimal divergence k ■ t*{k) grows 
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without bound as a function of k. 



Empirical power analysis 

We now explore subtree power maximization empirically, using the previously-reported CFTR 
sequence data [6| on 21 representative vertebrates (Table [T]). We estimated a phylogeny (Fig- 
ure |3l) based on a multiple sequence alignment, as described in the Appendix. This procedure 
yields phylogenies applicable to data outside the estimation region ||9l|T6l|. We formulated the 
likelihood-ratio statistic and calibrated the conserved rate threshold rc to correspond to typical 
genie conservation in the sequenced region. Having fixed the form of the testing procedure, the 
goal is to maximize its power to detect conservation over subsets of size k chosen from among 
the 21 species, for various values of k. This entails searching for the maximal-power family 
subtree, or /c-most-powerful Steiner subtree (k-MPSS), among the (^^^ subtrees with k leaves 
(see Appendix). A Steiner subtree on k leaves is the unique smallest subtree rooted at their last 
common ancestor. 

Table 121 shows the /c-MPSS (starred) in comparison to the subtree on k leaves with largest 
additive divergence (the A;-most-divergent Steiner subtree, or A;-MDSS, daggered). The latter has 
been the focus of previous work [4^ 9|. These two subtree selection criteria do not coincide. 
For instance, at ta? = 2, the 5-MPSS includes the dunnart, whereas the 5-MDSS instead uses the 
platypus. The t-statistic on the difference in power is 2.06, so variability in the power estimate 
is not a likely explanation. A more extreme example is r^v = 10: the 4-MPSS and 4-MDSS 
have only one species in common, and the absolute loss in power that results from using the 
4-MDSS is nearly 8.5% (t-statistic 105.7). Here, more than 4,400 subtrees have higher power 
than the 4-MDSS. The disagreement at higher values of k underscores the effect of phylogenetic 
topology on the detection of conservation. 

We carried out a similar comparison, under the constraint that the 9 completely or partially 
sequenced vertebrates in the data set appear in the subtree (Table |3l). This reveals the species 
whose addition to the current sequencing mix would most improve the power to detect single- 
site conservation. As in TableEl the most-powerful and most-divergent subtrees generally differ. 
The pattern of disagreement is not systematic: when r^v = 5, for example, they disagree at 10 
and 11 species, agree at 12 and 13, and disagree at 14. Table |2l exhibits similar properties. We 
conclude that the A;-MDSS is not a reliable surrogate for the A;-MPSS. Table |3l reveals that the 
single most beneficial species to sequence next is the dunnart (improving power by a relative 
12.5%), whereas the species which adds the most evolutionary divergence is the platypus. 

Discussion 

Even when the MPSS and MDSS coincide, the decision-theoretic point of view puts the focus on 
the important issue: the two kinds of discrimination errors and their probabilities. The power 
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calculation directly measures the marginal benefit of additional sequenced species as an in- 
crease in probability of conserved site detection. This enables us to choose a k which optimizes 
the tradeoff between the expected benefit of detecting conservation and the cost of additional 
sequencing. By contrast, the additive divergence of a species set gives no direct indication of 
how a procedure using those species will fare. Since the phylogeny and substitution process 
are parameters of our procedure, their choice can be tailored to particular investigations. Our 
emphasis on single-site detection of conservation will lead to conservative power estimates in 
situations where conservation is tested for simultaneously across multiple correlated sites. 
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Appendix 

Symmetric star topology 
Fully observed 

Let V = {p{xo, x; r) : r > 0} be the family of FOSST probability mass functions indexed by the 
mutation rate parameter r, for some fixed choice of descendant species count k and common 
branch length t. Here xq is the observed ancestral base and x = (xi, . . . , Xk) are the observed 
descendant bases. Write 

k 

n{xo,:>i-) =^5{xo,Xi) , (1) 
1=1 

where ■) is the Kronecker delta function. Under the Jukes-Cantor substitution process, with 
its equilibrium distribution (the uniform distribution) on the ancestral base, each member of V 
has the form 

1 A/l+3e-^'-*\'^"°'""^ _g-4r.t)Xl-5(-o,x,:) 

p(xo,x;r) = -n ^ ^ (2) 



i=l 



4 \ 4 / \ 4 



(3) 



Fixing rc = I entails no loss in generality, due to the nonidentifiability of the parameter pair 
(r, t) in the Jukes-Cantor substitution process. Choose r^v > 1- Substituting (HJ) into © and 
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simplifying the ratio p{xo, x; 1) /p{xo, x; r^) shows that the likelihood-ratio statistic for testing 
Hq : r > Tfsi versus Ha : r < 1 in the FOSST model has the form 

_|_ 3g-4t^n(xo,x)^]^ _ g-4t^fc-n(a::o,x) 



_|_ 3g-4rjvt^n(a;o,x)^]^ _ g-4rivt^A:-ri(xo,x) 

The family V has a monotone (decreasing) likelihood ratio in the statistic n(xo, x), because for 
each pair of rate parameters ri > r2 > 0, the likelihood ratio 

(1 + 3e-4'-2*)"(l - e-4r-2t)fc-n ~ 1^ 1 + 3e-4'^2* j \1 - e^^rat j 

is a decreasing function of = n{xo, x) G {0, 1, ... , k). This follows upon observing that, 
when ri > r2, 

1 + 3e-^''i* , 1 - e"^''!* 

^ — 7 < 1 and — 7 > 1 . 

1 + 3e-4^2t 1 _ 

Standard monotone likelihood-ratio theory [15| therefore implies that the likelihood-ratio test 
Ta, which rejects when ^ exceeds a critical value Cq, is uniformly most powerful for testing 
Hq : r > tn versus Ha : r < 1 at size a. The size is attained at the particular null distribution 

r = Tat. 

The theory also implies that, among the alternative distributions in Ha, Ta attains its lowest 
power against r = 1, yielding a lower bound on the power against any member of Ha- The 
power of To, against the particular alternative r = 1 can be written explicitly as a function of k 
and t: 

p{k, t) = GA{n^ + 1; A;) + /'^^^oK + l^\ ^ 

\ fo{na;k) J 

Here, /o(-; k) and k) are the probability mass functions of a Bin(/c, d{r, t)) random vari- 
able with r = tm and r = 1, respectively; Go{-] k) and Ga{-] k) are the corresponding (cadlag) 
cumulative binomial right-tail probabilities; d{r, t) = (1 + 3 exp(— 4rt))/4; and Ua is a known 
critical value. To derive Q, first note that 7^ is equivalent to the test which rejects Hq when 
the statistic n{xQ, x) exceeds a corresponding critical value rio,, again by virtue of the monotone 
likelihood-ratio property. Both tests thus have the same power p{k, t). Let Pq and Pa denote the 
distribution of n{XQ, X) under r = (the size-determining distribution) and r = 1, respec- 
tively. Because n{xo, x) can take on only finitely many values, we use randomized rejection to 
achieve level exactly a. The critical value is Ua = mm{n : Pq^u^Xq, X) > n) < a}. When 
n{xo, x) > Ua, we reject. When n{xo, x) = n^, we reject with probability 7(a) satisfying 

Po(n(Xo, X) > n„) + 7(a)Po(n(Xo, X) = n,) = « . (7) 

This implies that setting 

a - Po{n{Xo,X) > Ua) 
Po{n{XQ, X) = rio,) 
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guarantees a test with size a. It now follows that 



p{k, t) = PaHXo, X) > n„) + -f{a)PA{n{Xo, X) = n„) . 



(9) 



Under the star topology and Jukes-Cantor substitution process, each descendant nucleotide Xi 
has probability d{r,t) = (1 + 3exp(— 4rt))/4 of differing from Xq, independent of all other 
descendants. Thus n(Xo,X) is a Bm{k, d{r,t)) random variable. Equation Q follows upon 
substituting Go{na + l;k) for Po{n{Xo,X.) > ria), fo{na'-,k) for PQ{n{XQ,lC) = Uq), and 
similarly for Pa. 

Equation Q involves only known constants and binomial probabilities. The latter can be 
evaluated quickly to desired accuracy fTSl. This allows us to compute p{k, t) for many choices 
of k and t, leading to the power curves in Figure[lJ^.. The kinks in each power curve correspond 
to settings of t at which the critical value of the likelihood-ratio test changes. The locations 
of the kinks are easily determined, and the power curves are highly smooth between kinks. 
Thus, we can find t*{k) and p*{k) rapidly using a numerical optimization routine (Figure [TJ\, 
Figure I2K). 

Hidden ancestor 

Under the HASST model and Jukes-Cantor process, the likelihood-ratio statistic has the form 



This is more difficult to deal with than Q. It is clear that dlOb depends only the occurrence 
counts of the four different bases, not on the leaf configuration which gives rise to the counts. 
Indeed, (flOb is invariant when the bases associated with the counts are permuted. This means 
that there are only as many distinct values of (fTOt as there are integer partitions of k into four 
parts, with partition values of zero allowed. The number of leaf configurations corresponding 
to each integer partition is an easy combinatorial quantity. We can generate all the integer 
partitions and evaluate the HASST probability mass function at each one quickly, even for k as 
large as 100. 

Together, these facts allow us to compute the exact null distribution r = rjy and alternative 
distribution r = 1 of (fTOb . for each required setting of {a, r^, k, t). This yields the power of the 
HASST likelihood-ratio test, using formulas dH) and Q with the HASST distribution functions 
substituted for Pq and Pa- We then maximize each curve p(A;, ■) by brute force to determine 
t*{k) and p*{k) (FigureE^, FigureE^). 

Existence of maximal power 

We can explain the existence of a power-maximizing common branch length t* under the Jukes- 
Cantor process, and more generally under any continuous stationary Markov process, as fol- 
lows. Fix k. At evolutionary distance zero (t = 0), the distribution ]9(x; r) in a symmetric star 




-4t\n(a;o,x)/'i _ -4nfc-n(a;o,x) 



— 4rjvt')'^(a;o,x) ('1 p—4:rf^t'\k—n{xo,^) 



(10) 
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topology is the same for every mutation rate r. Thus the null and alternative hypotheses coin- 
cide. In this circumstance, the power is easily seen to equal a. In the limit of evolutionary time, 
as t ^ oo, the distribution of each descendant base approaches the process's stationary distri- 
bution, independent of the ancestral base. Since the stationary distribution does not involve the 
rate r, all conserved and nonconserved distributions converge to the same limit. The limiting 
power in t is therefore again a. The fact that power begins at a when t = and approaches a 
as t ^ oo, together with the fact that power is continuous in t and greater than a on (0, oo), 
implies a maximal power must be attained by some finite t*{k). 

Empirical power analysis 

We constructed a multiple alignment of 21 sequences (Table [B from the CFTR data set us- 
ing MAVID fW]. We then used maximum likelihood ll20lfT4l to fit a phylogenetic tree topology 
and branch lengths (Figure |3l) to to the alignment. Both the phylogeny estimation and subse- 
quent power analysis employed the nucleotide substitution process of Felsenstein ll2T1l . using a 
transition-transversion ratio of 2:1 and a uniform equilibrium nucleotide distribution. Branch 
lengths {tj} are measured in expected number of substitutions at an exonic aligned site. 

The phylogenetic topology of Figure |3l differs in a few ways from estimates based on con- 
siderations of large-scale indel mutations and morphology, for example in its placement of the 
chicken and platypus. At issue here, however, is its suitability for a single-site power analy- 
sis under a substitutional mutation model. We chose our tree estimation procedure to obtain a 
phylogeny compatible with the data and directed to this goal. 

Finding the /c-MPSS in a phylogeny is a combinatorial optimization problem, which we 
solve directly in small to moderate- sized cases by evaluating the power of the likelihood-ratio 
test based on every candidate Steiner subtree (Table|2l). We can also solve the problem directly 
for larger k, by constraining the species at many of the leaves in the subtree (Table 13). In order 
to compute power with a particular subtree, we used a Monte Carlo strategy. For each setting 
of tn, with a = 0.05, we generated 100,000 realizations from the null (r = rjv) and alternative 
(r = 1) distributions on the leaves of the full phylogeny. This induced null and alternative 
empirical distributions on the leaves of every possible subtree, from which we obtained approx- 
imations to the true null and alternative distributions of the likelihood-ratio test. These yielded 
approximate critical values as well as power estimates. We repeated the process of simulation 
and subtree power estimation ten times for each parameter setting; Tables|21and|31show averages 
and standard errors across repetitions. 
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Species 


1 


Baboon 


2 


Cat 


3 


Chicken 


4 


Chimpanzee 


5 


Cow 


6 


Dog 


7 


Dunnart 


8 


Fugu 


9 


Hedgehog 


10 


Horse 


11 


Human 


12 


Lemur 


13 


Macaque 


14 


Mouse 


15 


Opossum 


16 


Pig 


17 


Platypus 


18 


Rabbit 


19 


Rat 


20 


Tetraodon 


21 


Zebrafish 



Table 1: The 21 species whose CFTR region sequence data underlie the empirical power anal- 
ysis. 
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t vs. 



oize 


Species: * = MPSS, f = MDSS 


roweryo (oL) 


MPSS 


KariK 


2 


Rat, Zebransh *j 


6.79 (0.01) 




1.3 




Rat, Zebrafish, Chicken -*r| 


o.M (U.Ul ) 




l.D 


4 


Rat, Zebrafish, Chicken, Dog ★! 


9.61 (0.02) 




3.3 


5 


Rat, Zebrafish, Chicken, Dog, Dunnart * 


10.88 (0.03) 




4.4 




Rat, Zebrafish, Chicken, Dog, Platypus | 


10.80 (0.02) 


2.06 


21.7 


2 


Rat, Zebrafish *f 


10.60 (0.02) 




3.2 


3 


Rat, Zebrafish, Chicken -k] 


21.61 (0.06) 




1.8 


4 


Rat, Zebrafish, Chicken, Dog ★! 


39.33 (0.17) 




5.2 


5 


Rabbit, Cat, Dunnart, Chicken, Hedgehog ★ 


49.96 (0.07) 




12.2 




Rat, Zebrafish, Chicken, Dog, Platypus f 


47.31 (0.07) 


25.82 


3894.4 


2 


Dunnart, Lemur * 


13.30 (0.03) 




21.0 




Rat, Zebrafish f 


12.67 (0.02) 


16.67 


153.0 


3 


Dunnart, Cat, Zebrafish ★ 


37.53 (0.11) 




10.4 




Rat, Zebrafish, Chicken f 


36.83 (0.12) 


4.13 


77.2 


4 


Dunnart, Chicken, Hedgehog, Opossum -k 


64.69 (0.05) 




4.4 




Rat, Zebrafish, Chicken, Dog f 


56.21 (0.06) 


105.70 


4439.3 


5 


Macaque, Lemur, Dog, Cow, Pig * 


69.75 (0.11) 




8.6 




Rat, Zebrafish, Chicken, Dog, Platypus | 


66.86 (0.07) 


22.28 


4867.4 



Table 2: The /c-MPSS and A;-MDSS as a function of the nonconserved rate and the size k of the 
subtree, with a = 0.05 throughout. Results are across 10 repetitions of the Monte Carlo power 
estimation procedure (see Appendix). The last three columns display the average power (and 
standard error), the t-statistic for the power difference between the fc-MDSS and the A;-MPSS 
(in cases where they differ), and the average power ranking (among all subtrees). Since rc is 
calibrated to exonic conservation, the settings of r^v range from a neutral rate (r^r = 2) lfT6l 
towards extreme single-site mutability. 
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1 4 42 rO 04t 




1.1 




Platvnn^ "i" 
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3 4 


11 


Dunnart, Platypus * 


16.08 (0.05) 




1.6 




Platypus, Hedgehog \ 


15.85 (0.04) 


3.62 


6.2 


12 


Dunnart, Platypus, Hedgehog -k'\ 


17.88 (0.06) 




1.5 


13 


Dunnart, Platypus, Hedgehog, Rabbit *f 


19.80 (0.08) 




1.1 


14 


Dunnart Platvous Hedsehos Rabbit Cow -^t 


21 41 (0 08) 

^ X * r X. \ V/ * \j yj J 




1.6 


q 




Sfi 44 CO \ 
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1 1 

1 1 




71 OS OQ^ 








PlatvDiis Mpflp'phop^ i" 


70 54 (0 06) 


4.74 


14.6 


12 


Dunnart, Platypus, Hedgehog *t 


72.77 (0.08) 




1.2 


13 


Duimart, Platypus, Hedgehog, Rabbit *t 


76.02 (0.13) 




1.0 


14 


Duimart, Platypus, Hedgehog, Rabbit, Opossum ★ 


80.41 (0.10) 




2.2 




Dunnart, Platypus, Hedgehog, Rabbit, Cow f 


80.08 (0.14) 


1.88 


2.1 


9 


{clamped species only} 


86.61 (0.06) 






10 


Platypus -k\ 


91.67 (0.06) 




1.3 


11 


Duimart, Opossum ★ 


94.07 (0.02) 




3.3 




Platypus, Hedgehog f 


93.96 (0.03) 


2.66 


10.7 


12 


Dunnart, Platypus, Rabbit ★ 


95.84 (0.03) 




2.4 




Dunnart, Platypus, Hedgehog f 


95.79 (0.30) 


1.30 


4.4 


13 


Duimart, Platypus, Rabbit, Opossum ★ 


97.31 (0.02) 




4.6 




Dunnart, Platypus, Rabbit, Hedgehog \ 


97.29 (0.02) 


0.85 


6.6 


14 


Dunnart, Platypus, Rabbit, Hedgehog, Opossum * 


97.99 (0.01) 




2.4 




Dunnart, Platypus, Rabbit, Hedgehog, Cow f 


97.95 (0.02) 


1.83 


7.6 



Table 3: The /c-MPSS and A;-MDSS, under the constraint that the following nine species are 
included in the subtree: human, mouse, rat, chimpanzee, dog, chicken, fugu, zebrafish, and 
tetraodon. The scheme of the table is the same as Table 1. 
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Figure 1 : Power to detect conservation as a function of common branch length for the fully- 
observed (A) and hidden-ancestor (B) SSTs, using rc — hr^ — 2, and a — 0.05. Each power 
curve corresponds to an even number k of observed descendant species, from two (bottommost 
curve) to 100 (topmost). The maximum power attained for each k is indicated by a grey dot. 
The power analysis uses the Jukes-Cantor substitution process; power curves are computed 
analytically (see Appendix). Power curves computed with other values of vn and a remain 
qualitatively the same (not shown). 
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Figure 2: The optimal common branch length t*{k) in the fully-observed (A) and hidden- 
ancestor (B) SSTs, as a function of the number of descendant species k. Each black curve 
uses the indicated nonconserved rate r^v = 2, 3, 5, 7 with a = 0.05; grey curves are analogous 
with a = 0.01. As k increases, t*{k) stabilizes at a value depending on but not a. For the 
larger r^v's, the curves are terminated when power reaches 99.9%. 
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gure 3: The 21 -species phylogenetic tree estimate used in the empirical power analysis. 
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